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Abstract 

(Extended Abstract) 


1 Motivation 

Protein structure prediction plays an important role in the fields of bioinfor¬ 
matics and biology. Traditional protein structure prediction approaches include 
template-based modeling (TBM, including homology modeling, and threading), 
and free modeling (FM). In particular, a threading algorithm takes a query 
protein sequence as input, recognizes the most likely fold, and finally reports 
the alignments of the query sequence to structure-known templates as output. 
The existing threading approaches mainly utilizes the information of protein 
sequence profile, solvent accessibility, contact probability, etc. The threading 
strategy has been shown to be successful in structure prediction of a great 
amount of proteins; however, the existing threading approaches show poorly 
performance for remote homology proteins. How to improve the fold recog¬ 
nition for remote homology proteins remains a challenge to protein structure 
prediction. 
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The sequences of proteins in remote homology generally show relatively weak 
signal of structure. However, this does not mean that there is no sequence 
conservation hints for structure. The success of multiple-templates strategy 
implies the existence of common frameworks, i.e. some regions of proteins are 
conservative both in the structure and sequence. Such common frameworks 
should be responsible to the structural stability and then conservative in the 
evolution. 

Based on this we proposed a novel threading approach in three steps. First, 
for each template, the common structural frameworks shared by its homolo¬ 
gous proteins were calculated. Second, unlike in traditional threading methods 
where the alignment is made against the whole template, we aligned the query 
protein sequence against a common framework first. This strategy avoids the 
drawback of the traditional threading approach, i.e. the alignment of variable 
regions beyond conserved motifs is prone to bringing in error. Third, the fi¬ 
nal alignments were generated via aligning query sequence against candidate 
full-length templates in the family. Briefly speaking, we run TreeThreaderP] 
to build alignments of query against the new template database, and ranked 
alignments by E-value for model generation. Finally, we generated models by 
MODELLER based on candidate alignments. The generated models are ranked 
according to dDFIRE[3] energy function. 

2 Methods 

For each template with known structure, all of its remote homology proteins are 
first identified based on structure alignment. Then, a linear programming was 
designed to identify the common framework shared by these remote homology 
proteins. 

The common framework identification problem can be described as: given a 
collection of homologous proteins H = {si,...,SAr} with length Li,..., L^r, the 
objective is to find m segments with length n with high sequence conservation 
and structural similarity. As an example. Fig. [T] shows the common frameworks 
shared by protein 3gxr_A and its homologous proteins. 

2.1 Basic idea of the linear program 

The common framework poses double-fold requirements, i.e., significantly high 
sequence conservation and structural similarity. In the linear program, the 
objective function was designed to describe structural similarity, and the con¬ 
straints were designed to describe sequence similarities. 

Specifically, the linear program utilizes a set of boolean variables to represent 
the location of conserved segments, i.e., = 1 denotes that in the ith protein, 
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Figure 1: Common frameworks shared by protein 3gxr_A and its homologous 
proteins. The common framework consists of three dispersed segments (in yel¬ 
low, cyan, and green). At the conserved segments, the homologous proteins 
display significant sequence conservation and structural similarity. 


the fcth segment is located at the j-th residue. Then, the structural similarity 
objective and sequence similarity constraints can be described using x^j. 

The constraints were designed to represent the following requirements. 

• For any sequence, the fcth segment in common framework is unique; 

• No segment in a common framework overlaps nor crosses. 

• The segments should have significantly high sequence similarity. 

The integer linear programming model can be described as: 
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where Li denotes the length of the ith protein, and M denotes the pre-calculated 
sequence similarity matrix. In particular, the cell denotes the sequence 

similarity between of the segment starting from ji in the iith protein and the 
segment starting from j 2 in the Z 2 th protein. 



2.2 Refining the ILP model 


In our model, the structure similarity is described using Dscore [T]. 
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where a^- and bij denote the Cq, distance of residue i and residue j in protein 
A and B respectively. 

The final integer linear programming can be formulated as: 
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where the indicator ^ equals to 1 iff all the four item x. .fci, x. .fca, 
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X. ki, X. -kn equal to 1, and 0 otherwise. The indicator . equals to 1 iff 
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both and equal to 1. The D and M matrix are calculated in advance. 
The cell *2521522 denotes the approximated Dscore of segments start 

from jii and J 21 in the iith protein and segments start from J 21 and J 22 in the 
i 2 th protein. 













Figure 2: Common frameworks shared by two domains in SCOP family 

c.37.1.11. 

2.3 An example 

Fig. [5] shows the common frameworks shared by two domains in SCOP family 
c.37.1.11. The common frameworks has a Dscore of 8.35 and an RMSD of 1.9A, 
implying a significantly high structure conservation. 

3 Experiments before CASP 11 

For a total of over 27,000 proteins in PDB70, updated at Apr. 19, 2014, the 
common frameworks were identified to yield a database called TOPO. The test 
set consists of 142 pairs of protein structures similar in structure but with low 
sequence identity. Traditional threading approaches, say HHpred, fail to build 
an accurate alignment between such protein pairs. In contrast, our alignment 
method successfully build accruate alignment (TMscore> 0.4) for seven protein 
pairs, and generate accurate contact information for 45 protein pairs. Take 
a pair of protein 3dzl_A vs. ltwd_A as an example. The two proteins share 
similar protein structure (TMscore=0.56); however, the alignment generated 
by HHpred has a TMscore of only 0.22. In contrast, our alignment method 
generates an alignment with TMscore=0.43. 

4 Conclusions 

Unlike close homology proteins, remote homology proteins show weakly over¬ 
all sequence signals of structure similarity. However, they still share common 


frameworks which carry strong sequence signals of structure similarity. Aligning 
against the common frameworks instead of whole protein sequences improves 
the fold recognition. 
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